Firstly, we import the data and do data cleanning. We drop the variable Trades and %Deliverable. Also, we transform the Date variable from string type to date type to treat the whole data set as time series data set.
Since some stocks changed thier name/symbols during this time of period, we need to fix the inconsistency problem and merge the spilted data together.
import delimited NIFTY50_all.csv, clear
* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"
* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"
* Save the cleaned data
save NIFTY_clean, replace
Then we visualize the data and the stock “ADANIPORTS” is taken as an example.
use NIFTY_clean, clear
keep if symbol == "ADANIPORTS"
graph twoway line vwap date, color("blue") xtitle("Days") ///
ytitle("Volume weighted average price")
graph export vwap_date.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export volume_date.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ytitle("Turnover")
graph export turnover_date.png, replace
We will use the time series VWAP for the analysis below.
For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.
use NIFTY_clean, clear
local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"
foreach sym of local sbls_f5 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
dfuller d1.vwap
}
We do the test on the vwap with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.
Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots. Here, the parameter \(p\) represents the number of lags of this model. We only consider relationships for one variable and \(p\) variables beyond it. The MA parameter \(q\) has exactly the same meaning as AR models.
Note: we will only plot the first 5 stocks as an example.
use NIFTY_clean, clear
foreach sym of local sbls_f5 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
ac vwap
graph export acf_`sym'.png
pac vwap
graph export pacf_`sym'.png
}
The PACF plots for these stocks are the following:
And the ACF plots for the these 5 stocks are the following:
We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.
According to the process above, we choose the \(ARIMA(1, 1, 1)\) (where the first parameter is \(p\) , the second is \(d\) and the third is \(p\)) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.
Note: we will only plot the first 5 stocks as an example.
use NIFTY_clean, clear
local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"
foreach sym of local sbls_f5 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
arima vwap, arima(1,1,1)
estat ic
mat l_aim = r(S)
scalar aic_aim = l_aim[1,5]
arima vwap, arima(1,1,0)
estat ic
mat l_ai = r(S)
scalar aic_ai = l_aim[1,5]
if aic_aim > aic_ai {
tsappend, add(200)
arima vwap, arima(1,1,0)
predict vwap_pd
gen vwap_p = vwap_pd + vwap
replace vwap_p=vwap_p[_n-1]+ vwap_pd[_n] if _n > _N - 200
graph twoway line vwap date, lwidth("vthin") color("blue") || line ///
vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png, replace
}
else {
tsappend, add(200)
arima vwap, arima(1,1,1)
predict vwap_pd
gen vwap_p = vwap_pd + vwap
replace vwap_p=vwap_p[_n-1]+ vwap_pd[_n] if _n > _N - 200
graph twoway line vwap date, lwidth("vthin") color("blue") || line ///
vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png, replace
}
}
The regression coefficient is the following:
ADANIPORTS
ASIANPAINT
AXISBANK
BAJAJ-AUTO
BAJAJFINSV
Also, the out-of-sample prediction is implemented here. we tried to predict the tendency of the stoch price in next 200 trading days and he sample fitted graphs are:
Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.
However, Stata does not have some similar funciton as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.